Hotel Booking Analysis

Maftei Alexandru, Vasile Catalina

Introduction

The dataset contains information on bookings in two types of hotels, both located in Portugal,a resort hotel and a city hotel. Each observation represents a hotel reservation. The data set includes reservations between July 1, 2015 and August 31, 2017, including those canceled. Because it is the actual data of the hotel, all information regarding the identification of the hotel or the client has been deleted. Due to the lack of real business data for scientific and educational purposes, these data sets can play an important role for research and education in revenue management, machine learning or data mining, as well as in other fields.

Every year, more than 140 million bookings were made on the Internet and hotel booking cancellations is a growing problem. An analysis of the last 5 years showed that the cancellation rate on booking has reached almost 40% and this trend produces a very negative impact on hotels revenue and distribution management strategies.

So, it is useful to try to understand and even predict which guests are more likely to cancel their bookings by getting insights from the data set and discovering which features have contributed more to cancellations.

The dataset

Preparation of the data set

## [1] 4
## [1] 0
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...

Exploratory Data Analysis

We will try to answer to questions such as:

We will also try to find different relationships between the attributes.

Graphics

## Removing package from 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Warning: package 'plotly' was built under R version 3.6.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'ggplot2' is in use and will not be installed

Most of the cancellations are made at city hotels.

## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

However, clients still prefer city hotels instead of resorts.

## 
##   City Hotel Resort Hotel 
##        79330        40060
##    
##     City Hotel Resort Hotel
##   0      46228        28938
##   1      33102        11122
##    mydataset$adults length(is_canceled)
## 1                 0                 403
## 2                 1               23027
## 3                 2               89680
## 4                 3                6202
## 5                 4                  62
## 6                 5                   2
## 7                 6                   1
## 8                10                   1
## 9                20                   2
## 10               26                   5
## 11               27                   2
## 12               40                   1
## 13               50                   1
## 14               55                   1

Couples are more likely to cancel their bookings.

The transient tourists are more likely to cancel their bookings.

##   mydataset$market_segment length(is_canceled)
## 1                 Aviation                 237
## 2            Complementary                 743
## 3                Corporate                5295
## 4                   Direct               12606
## 5                   Groups               19811
## 6            Offline TA/TO               24219
## 7                Online TA               56477
## 8                Undefined                   2

The online segment has the greatest number of cancellations.

##    mydataset$reserved_room_type length(is_canceled)
## 1                             A               85994
## 2                             B                1118
## 3                             C                 932
## 4                             D               19201
## 5                             E                6535
## 6                             F                2897
## 7                             G                2094
## 8                             H                 601
## 9                             L                   6
## 10                            P                  12

Room of type A has the greatest number of cancellations.

##    mydataset$arrival_date_month length(is_canceled)
## 1                         April               11089
## 2                        August               13877
## 3                      December                6780
## 4                      February                8068
## 5                       January                5929
## 6                          July               12661
## 7                          June               10939
## 8                         March                9794
## 9                           May               11791
## 10                     November                6794
## 11                      October               11160
## 12                    September               10508

July and august are also the months with the greatest number of cancellations.

## PRT    0.406986
## GBR    0.101591
## FRA    0.087235
## ESP    0.071765
## DEU    0.061035
## Name: country, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003ABFD908>
## Text(0.5, 1.0, 'Most Popular Countries of Origin of the Guests')
## Text(0.5, 0, 'Country')

People from almost the entire world choose to spend their holiday in these two hotels. The majority is of course, from Portugal, and also from european countries, such as Great Britain and France.

## August       0.116233
## July         0.106047
## May          0.098760
## October      0.093475
## April        0.092880
## June         0.091624
## September    0.088014
## March        0.082034
## February     0.067577
## November     0.056906
## December     0.056789
## January      0.049661
## Name: arrival_date_month, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000002EC9AB38>
## Text(0.5, 1.0, 'Most Occupied (Busiest) Month with Bookings')
## Text(0.5, 0, 'Month')
## (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), <a list of 12 Text major ticklabel objects>)

August is the month with the highest number of bookings, followed by July, while January is the most unoccupied month.

## 0    0.629584
## 1    0.370416
## Name: is_canceled, dtype: float64
## Text(0.5, 1.0, 'Proportion of Cancelled & Not Cancelled Bookings')
## ([<matplotlib.patches.Wedge object at 0x000000003A920CC0>, <matplotlib.patches.Wedge object at 0x000000003A92E748>], [Text(-0.4355420495755465, 1.0101005509609093, 'Not Cancelled'), Text(0.4355420495755469, -1.0101005509609091, 'Cancelled')], [Text(-0.2375683906775708, 0.5509639368877687, '63.0%'), Text(0.237568390677571, -0.5509639368877685, '37.0%')])

This piechart shows the proportion of cancelled and not cancelled bookings. 37% of the bookings were cancelled, which is a high percentage and which suggests that some measures should be taken.

## Online TA        0.473046
## Offline TA/TO    0.202856
## Groups           0.165935
## Direct           0.105587
## Corporate        0.044350
## Complementary    0.006223
## Aviation         0.001985
## Undefined        0.000017
## Name: market_segment, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A838F28>
## 
## C:\Users\catal\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
##   FutureWarning
## Text(0.5, 1.0, 'Total Number of Bookings by Market Segment')
## Text(0.5, 0, 'Market Segment')

Most of the bookings are made through online travel agents and less than 20% are made directly by tourists.

## Transient          0.750591
## Transient-Party    0.210436
## Contract           0.034140
## Group              0.004833
## Name: customer_type, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A9DF9E8>
## Text(0.5, 1.0, 'Total Number of Bookings by Customer Type')
## Text(0.5, 0, 'Market Segment')

This plot depicts that 75% of the bookings are transient bookings, 21% are transient-party and almost 3% are contract bookings.

## 0    0.588977
## 1    0.278298
## 2    0.108627
## 3    0.020915
## 4    0.002848
## 5    0.000335
## Name: total_of_special_requests, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A838908>
## Text(0.5, 1.0, 'Total Special Request')
## Text(0.5, 0, 'Number of Special Request')

Almost 60% of the bookings come with no special requests.

## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A955DD8>
## Text(0.5, 1.0, 'Room price per night and person over the year')
## Text(0.5, 0, 'Arrival Month')
## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], <a list of 12 Text major ticklabel objects>)
## Text(0, 0.5, 'ADR [EUR]')

The price per night is the most expensive in July, August and September for the resort hotel, while for the city hotel, the highest prices are in March, April and May.

The guests who choose to book again are more likely to not cancel their bookings.

## No Deposit    0.876464
## Non Refund    0.122179
## Refundable    0.001357
## Name: deposit_type, dtype: float64
## Text(0.5, 1.0, 'Proportion of Total Bookings by Deposit Type')
## ([<matplotlib.patches.Wedge object at 0x000000003A70A908>, <matplotlib.patches.Wedge object at 0x000000003A90C160>, <matplotlib.patches.Wedge object at 0x000000003A73CEB8>], [Text(-1.0181924325401428, 0.4162741528343872, 'No Deposit'), Text(1.0164087119406244, -0.4206106635490841, 'Non Refundable'), Text(1.0999900062128796, -0.004688947833933911, 'Refundable')], [Text(-0.5553776904764415, 0.2270586288187566, '87.6%'), Text(0.5544047519676133, -0.2294239982995004, '12.2%'), Text(0.5999945488433888, -0.002557607909418497, '0.1%')])

The majority of the bookings are made without deposit.

## deposit_type  is_canceled
## No Deposit    0              0.716230
##               1              0.283770
## Non Refund    1              0.993624
##               0              0.006376
## Refundable    0              0.777778
##               1              0.222222
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A96C160>
## Text(0.5, 1.0, 'Effect of Deposit Type on Cancellations')
## Text(0.5, 0, 'Depost Type')

Around 28% bookings with no deposit and 22% bookings with refund were cancelled. Guests who are not obliged to make a deposit are obviously prone to cancel their bookings.

## meal       is_canceled
## BB         0              0.626151
##            1              0.373849
## FB         1              0.598997
##            0              0.401003
## HB         0              0.655397
##            1              0.344603
## SC         0              0.627606
##            1              0.372394
## Undefined  0              0.755346
##            1              0.244654
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A7DEA20>
## Text(0.5, 1.0, 'Effect of Meal type on Cancellations')
## Text(0.5, 0, 'Meal type')

The bed&breakfast meal is the most preffered by tourists and obviously, this type has the highest number of cancellations.

## required_car_parking_spaces  is_canceled
## 0                            0              0.605051
##                              1              0.394949
## 1                            0              1.000000
## 2                            0              1.000000
## 3                            0              1.000000
## 8                            0              1.000000
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A77E080>
## Text(0.5, 1.0, 'Effect of Car Parking Space on Cancellations')
## Text(0.5, 0, 'Number of Car Parking Spaces')

Almost 40% of the bookings were cancelled by guests who did not ask for a parking space.

## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003AA905C0>
## Text(0.5, 1.0, 'Arrival Year vs Lead Time By Cancellation Status')
## Text(0.5, 0, ' Arrival Year')
## Text(0, 0.5, 'Lead Time')

Bookings with lead time less than 100 days are less likely to be cancelled.

EDA conclusions

Logistic regression

We will start by computing the correlations between each pair of numerical variables. These correlations will be represented in a correlation matrixt in order to have an idea of what variables are changing together.

## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## corrplot 0.88 loaded
##                                 is_canceled days_in_waiting_list
## is_canceled                     1.000000000          0.054185824
## days_in_waiting_list            0.054185824          1.000000000
## required_car_parking_spaces    -0.195497817         -0.030600046
## is_repeated_guest              -0.084793418         -0.022234965
## previous_bookings_not_canceled -0.057357723         -0.009396978
## booking_changes                -0.144380991         -0.011633945
## previous_cancellations          0.110132808          0.005928941
## lead_time                       0.293123356          0.170084184
## total_of_special_requests      -0.234657774         -0.082729719
## adr                             0.047556598         -0.040756412
## stays_in_week_nights            0.024764629         -0.002019810
## stays_in_weekend_nights        -0.001791078         -0.054151113
## adults                          0.060017213         -0.008283347
## children                        0.005036255         -0.033271416
## babies                         -0.032491089         -0.010620543
##                                required_car_parking_spaces is_repeated_guest
## is_canceled                                    -0.19549782      -0.084793418
## days_in_waiting_list                           -0.03060005      -0.022234965
## required_car_parking_spaces                     1.00000000       0.077089573
## is_repeated_guest                               0.07708957       1.000000000
## previous_bookings_not_canceled                  0.04765309       0.418055995
## booking_changes                                 0.06562019       0.012091787
## previous_cancellations                         -0.01849225       0.082293234
## lead_time                                      -0.11645057      -0.124409908
## total_of_special_requests                       0.08262634       0.013050009
## adr                                             0.05662809      -0.134314447
## stays_in_week_nights                           -0.02485942      -0.097244972
## stays_in_weekend_nights                        -0.01855381      -0.087239379
## adults                                          0.01478482      -0.146426116
## children                                        0.05625495      -0.032857741
## babies                                          0.03738336      -0.008942634
##                                previous_bookings_not_canceled booking_changes
## is_canceled                                      -0.057357723   -0.1443809911
## days_in_waiting_list                             -0.009396978   -0.0116339446
## required_car_parking_spaces                       0.047653087    0.0656201914
## is_repeated_guest                                 0.418055995    0.0120917873
## previous_bookings_not_canceled                    1.000000000    0.0116075289
## booking_changes                                   0.011607529    1.0000000000
## previous_cancellations                            0.152728115   -0.0269926626
## lead_time                                        -0.073548168    0.0001488301
## total_of_special_requests                         0.037823776    0.0528334357
## adr                                              -0.072144196    0.0196176738
## stays_in_week_nights                             -0.048742550    0.0962094460
## stays_in_weekend_nights                          -0.042715235    0.0632813159
## adults                                           -0.107983172   -0.0516727735
## children                                         -0.021071664    0.0489516990
## babies                                           -0.006550454    0.0834397814
##                                previous_cancellations     lead_time
## is_canceled                               0.110132808  0.2931233558
## days_in_waiting_list                      0.005928941  0.1700841843
## required_car_parking_spaces              -0.018492250 -0.1164505701
## is_repeated_guest                         0.082293234 -0.1244099080
## previous_bookings_not_canceled            0.152728115 -0.0735481679
## booking_changes                          -0.026992663  0.0001488301
## previous_cancellations                    1.000000000  0.0860418019
## lead_time                                 0.086041802  1.0000000000
## total_of_special_requests                -0.048384118 -0.0957120489
## adr                                      -0.065645638 -0.0630768525
## stays_in_week_nights                     -0.013992431  0.1657993639
## stays_in_weekend_nights                  -0.012774619  0.0856711329
## adults                                   -0.006738096  0.1195186926
## children                                 -0.024729166 -0.0376128161
## babies                                   -0.007500998 -0.0209150163
##                                total_of_special_requests         adr
## is_canceled                                  -0.23465777  0.04755660
## days_in_waiting_list                         -0.08272972 -0.04075641
## required_car_parking_spaces                   0.08262634  0.05662809
## is_repeated_guest                             0.01305001 -0.13431445
## previous_bookings_not_canceled                0.03782378 -0.07214420
## booking_changes                               0.05283344  0.01961767
## previous_cancellations                       -0.04838412 -0.06564564
## lead_time                                    -0.09571205 -0.06307685
## total_of_special_requests                     1.00000000  0.17218526
## adr                                           0.17218526  1.00000000
## stays_in_week_nights                          0.06819178  0.06523748
## stays_in_weekend_nights                       0.07267083  0.04934191
## adults                                        0.12288355  0.23064122
## children                                      0.08173584  0.32485303
## babies                                        0.09788879  0.02918569
##                                stays_in_week_nights stays_in_weekend_nights
## is_canceled                              0.02476463            -0.001791078
## days_in_waiting_list                    -0.00201981            -0.054151113
## required_car_parking_spaces             -0.02485942            -0.018553809
## is_repeated_guest                       -0.09724497            -0.087239379
## previous_bookings_not_canceled          -0.04874255            -0.042715235
## booking_changes                          0.09620945             0.063281316
## previous_cancellations                  -0.01399243            -0.012774619
## lead_time                                0.16579936             0.085671133
## total_of_special_requests                0.06819178             0.072670830
## adr                                      0.06523748             0.049341906
## stays_in_week_nights                     1.00000000             0.498968818
## stays_in_weekend_nights                  0.49896882             1.000000000
## adults                                   0.09297551             0.091871020
## children                                 0.04420335             0.045793885
## babies                                   0.02019097             0.018482810
##                                      adults     children       babies
## is_canceled                     0.060017213  0.005036255 -0.032491089
## days_in_waiting_list           -0.008283347 -0.033271416 -0.010620543
## required_car_parking_spaces     0.014784817  0.056254947  0.037383356
## is_repeated_guest              -0.146426116 -0.032857741 -0.008942634
## previous_bookings_not_canceled -0.107983172 -0.021071664 -0.006550454
## booking_changes                -0.051672774  0.048951699  0.083439781
## previous_cancellations         -0.006738096 -0.024729166 -0.007500998
## lead_time                       0.119518693 -0.037612816 -0.020915016
## total_of_special_requests       0.122883546  0.081735841  0.097888792
## adr                             0.230641216  0.324853030  0.029185690
## stays_in_week_nights            0.092975513  0.044203353  0.020190974
## stays_in_weekend_nights         0.091871020  0.045793885  0.018482810
## adults                          1.000000000  0.030440359  0.018145642
## children                        0.030440359  1.000000000  0.024030235
## babies                          0.018145642  0.024030235  1.000000000

The blue points suggest a positive correlation, while the red ones suggest a negative one. The bigger the point, the stronger the correlation.

We can observe some correlations between variables, both negative and positive, but very weak.

In order to build the model of logistic regression we will use the glm function. The response is represented by is_canceled and the predictors are represented by some independent variables.

We clearly specify that we want a logistic regression by setting the attribute family as “binomial”.

## 
## Call:
## glm(formula = is_canceled ~ lead_time + customer_type + hotel + 
##     deposit_type + adr + total_of_special_requests, family = "binomial", 
##     data = mydataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9767  -0.8087  -0.5849   0.2038   2.7645  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.8814713  0.0475111 -39.601  < 2e-16 ***
## lead_time                     0.0048474  0.0000781  62.066  < 2e-16 ***
## customer_typeGroup           -0.6239389  0.1473791  -4.234 2.30e-05 ***
## customer_typeTransient        0.6274113  0.0445352  14.088  < 2e-16 ***
## customer_typeTransient-Party -0.2813941  0.0466103  -6.037 1.57e-09 ***
## hotelResort Hotel            -0.3099591  0.0152919 -20.270  < 2e-16 ***
## deposit_typeNon Refund        5.1679826  0.1047350  49.343  < 2e-16 ***
## deposit_typeRefundable       -0.0388890  0.1961689  -0.198    0.843    
## adr                           0.0048739  0.0001476  33.031  < 2e-16 ***
## total_of_special_requests    -0.5470358  0.0101321 -53.990  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157398  on 119389  degrees of freedom
## Residual deviance: 117225  on 119380  degrees of freedom
## AIC: 117245
## 
## Number of Fisher Scoring iterations: 7

Summary returns the standard errors, z-score, estimations and p-values for each of the coefficients. According to the results, all coefficients, with one exception, are meaningful.

The next step is to make predictions.

## [1] 0.5234317 0.8816891 0.2378428 0.2431552 0.1728522 0.1728522

We will get an array of probabilities. The first two probabilities are bigger, measuring aproximatively 50% and 88%. We will compute the accuracy of the model and try to predict whether the bookings will be canceled or not, based on the variables chosen as predictors. In order to do this, we will use the ifelse command and as a threshold we will compute a mean.

## [1] 0.3704163
Confusion matrix

Confusion matrix

TP=true positive TN=true negative FP=false positive FN=false negative

##    
##         0     1
##   0 62233 17077
##   1 12933 27147
## [1] 0.7486389

The accuracy of the model is almost 75%. According to the confusion matrix, the model predicted correctly that 62233 bookings won’t be canceled, but classified wrong 17077 bookings. By analogy, the model predicted wrong 12933 bookings and classified correctly 27147 bookings as being cancelled.

As a last step, we will analyse two parameters, AUC and ROC, which are used to assess the performance of the classification model. AUC= area under curve ROC= receiver operating characteristics

AUC can tell us whether the model is able to distinguish between the classes. The higher it is, the better the model will predict if the result will be 1 or 0.

## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'pROC' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The area under the ROC curve is AUC and is computed as this:

## Area under the curve: 0.7209

AUC has a value close to 1, so we can say that we have a good prediction model.

Random forest

## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'lattice' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lattice'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\catal\OneDrive\Documents\R\win-
## library\3.6\00LOCK\lattice\libs\x64\lattice.dll to C:
## \Users\catal\OneDrive\Documents\R\win-library\3.6\lattice\libs\x64\lattice.dll:
## Permission denied
## Warning: restored 'lattice'
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'dplyr' is in use and will not be installed
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.1     v purrr   0.3.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks plotly::filter(), stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## Warning: package 'lattice' was built under R version 4.0.5
## Warning: package 'caret' was built under R version 4.0.5
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

We will use PCA algorithm to reduce dimesion of our date

##   lead_time arrival_date_year arrival_date_week_number
## 1       342              2015                       27
## 2       737              2015                       27
## 3         7              2015                       27
## 4        13              2015                       27
## 5        14              2015                       27
## 6        14              2015                       27
##   arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults
## 1                         1                       0                    0      2
## 2                         1                       0                    0      2
## 3                         1                       0                    1      1
## 4                         1                       0                    1      1
## 5                         1                       0                    2      2
## 6                         1                       0                    2      2
##   children babies is_repeated_guest previous_cancellations
## 1        0      0                 0                      0
## 2        0      0                 0                      0
## 3        0      0                 0                      0
## 4        0      0                 0                      0
## 5        0      0                 0                      0
## 6        0      0                 0                      0
##   previous_bookings_not_canceled booking_changes days_in_waiting_list adr
## 1                              0               3                    0   0
## 2                              0               4                    0   0
## 3                              0               0                    0  75
## 4                              0               0                    0  75
## 5                              0               0                    0  98
## 6                              0               0                    0  98
##   required_car_parking_spaces total_of_special_requests   predict predictions
## 1                           0                         0 0.5234317           1
## 2                           0                         0 0.8816891           1
## 3                           0                         0 0.2378428           0
## 4                           0                         0 0.2431552           0
## 5                           0                         1 0.1728522           0
## 6                           0                         1 0.1728522           0
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6172 1.3813 1.26126 1.1874 1.15730 1.04284 1.01679
## Proportion of Variance 0.1376 0.1004 0.08372 0.0742 0.07049 0.05724 0.05441
## Cumulative Proportion  0.1376 0.2381 0.32180 0.3960 0.46649 0.52373 0.57815
##                            PC8     PC9    PC10   PC11    PC12   PC13    PC14
## Standard deviation     1.00615 0.97687 0.96751 0.9399 0.93127 0.8707 0.81504
## Proportion of Variance 0.05328 0.05022 0.04927 0.0465 0.04565 0.0399 0.03496
## Cumulative Proportion  0.63143 0.68165 0.73092 0.7774 0.82306 0.8630 0.89793
##                           PC15   PC16    PC17    PC18    PC19
## Standard deviation     0.75140 0.7029 0.61136 0.54979 0.45251
## Proportion of Variance 0.02972 0.0260 0.01967 0.01591 0.01078
## Cumulative Proportion  0.92764 0.9536 0.97331 0.98922 1.00000

PC 1 2 and 3 have the bigest values of Proportion of Variance and the smallest of Cumulative Proportion

##         PC1        PC2        PC3        PC4         PC5        PC6        PC7
## 1 -1.715752 -1.6041982 -0.9598965  0.5578969 0.369869042  2.5424196  0.5917407
## 2 -3.968765 -1.4271937 -1.1697921  1.4656700 0.004241446  3.7990130  1.4356120
## 3  1.057518 -1.9002481 -0.7015852 -0.7152206 0.911370494 -0.1746979 -0.8227414
## 4  1.021634 -1.9003218 -0.7037807 -0.7054257 0.907879305 -0.1702983 -0.8091639
## 5  1.298097 -0.5787277 -0.8997913 -0.8363931 0.429996163 -1.0175978  0.3036304
## 6  1.298097 -0.5787277 -0.8997913 -0.8363931 0.429996163 -1.0175978  0.3036304
##         PC8        PC9        PC10       PC11       PC12        PC13
## 1 -2.322707 -0.5698578  0.01176266 -2.0810442 -3.0386984  0.33153092
## 2 -2.385715 -0.6223914  0.13305467 -2.9581197 -4.5118811 -0.98868782
## 3 -1.680015 -0.5546796 -0.79908238  0.3396690  0.1298267 -0.04607736
## 4 -1.676955 -0.5555551 -0.79701853  0.3379963  0.1177325 -0.07156416
## 5 -1.309910 -0.7148110 -0.63236473  0.1428980 -0.3949481  0.27986457
## 6 -1.309910 -0.7148110 -0.63236473  0.1428980 -0.3949481  0.27986457
##          PC14      PC15       PC16      PC17       PC18        PC19
## 1 -1.80919683 0.5568728 -0.6000674 0.7629200 -0.3368832 -0.79694361
## 2 -2.73562692 0.6690934 -0.8049877 0.4787150 -2.1372014 -0.06477837
## 3  0.33140496 0.2452882  0.2160574 0.4865680 -0.4854662 -0.16119232
## 4  0.31731314 0.2470322  0.2141921 0.4803561 -0.5137804 -0.15123404
## 5 -0.02012024 0.1633696  0.5562939 0.8581534 -0.2581641 -0.21462075
## 6 -0.02012024 0.1633696  0.5562939 0.8581534 -0.2581641 -0.21462075
##         PC1        PC2        PC3
## 1 -1.715752 -1.6041982 -0.9598965
## 2 -3.968765 -1.4271937 -1.1697921
## 3  1.057518 -1.9002481 -0.7015852
## 4  1.021634 -1.9003218 -0.7037807
## 5  1.298097 -0.5787277 -0.8997913
## 6  1.298097 -0.5787277 -0.8997913
##          hotel arrival_date_month meal country market_segment
## 1 Resort Hotel               July   BB     PRT         Direct
## 2 Resort Hotel               July   BB     PRT         Direct
## 3 Resort Hotel               July   BB     GBR         Direct
## 4 Resort Hotel               July   BB     GBR      Corporate
## 5 Resort Hotel               July   BB     GBR      Online TA
## 6 Resort Hotel               July   BB     GBR      Online TA
##   distribution_channel reserved_room_type assigned_room_type deposit_type agent
## 1               Direct                  C                  C   No Deposit  NULL
## 2               Direct                  C                  C   No Deposit  NULL
## 3               Direct                  A                  C   No Deposit  NULL
## 4            Corporate                  A                  A   No Deposit   304
## 5                TA/TO                  A                  A   No Deposit   240
## 6                TA/TO                  A                  A   No Deposit   240
##   company customer_type reservation_status reservation_status_date
## 1    NULL     Transient          Check-Out              2015-07-01
## 2    NULL     Transient          Check-Out              2015-07-01
## 3    NULL     Transient          Check-Out              2015-07-02
## 4    NULL     Transient          Check-Out              2015-07-02
## 5    NULL     Transient          Check-Out              2015-07-03
## 6    NULL     Transient          Check-Out              2015-07-03

Since we have imbalanced kind of dataset, Ensemble methods will avoid overfitting problems We dont required ‘reservation status date, company(94% missing values) & agent ID’ Randomforest wont work for variable which have more than 52 Levels. since ‘country’feature has 178 level, doing ’one hot encoding’ for all values will result curse in dimensionality

We add the categorical variables with the created response variables

##   is_canceled        hotel arrival_date_month meal country market_segment
## 1           0 Resort Hotel               July   BB     PRT         Direct
## 2           0 Resort Hotel               July   BB     PRT         Direct
## 3           0 Resort Hotel               July   BB     GBR         Direct
## 4           0 Resort Hotel               July   BB     GBR      Corporate
## 5           0 Resort Hotel               July   BB     GBR      Online TA
## 6           0 Resort Hotel               July   BB     GBR      Online TA
##   distribution_channel reserved_room_type assigned_room_type deposit_type agent
## 1               Direct                  C                  C   No Deposit  NULL
## 2               Direct                  C                  C   No Deposit  NULL
## 3               Direct                  A                  C   No Deposit  NULL
## 4            Corporate                  A                  A   No Deposit   304
## 5                TA/TO                  A                  A   No Deposit   240
## 6                TA/TO                  A                  A   No Deposit   240
##   company customer_type reservation_status reservation_status_date       PC1
## 1    NULL     Transient          Check-Out              2015-07-01 -1.715752
## 2    NULL     Transient          Check-Out              2015-07-01 -3.968765
## 3    NULL     Transient          Check-Out              2015-07-02  1.057518
## 4    NULL     Transient          Check-Out              2015-07-02  1.021634
## 5    NULL     Transient          Check-Out              2015-07-03  1.298097
## 6    NULL     Transient          Check-Out              2015-07-03  1.298097
##          PC2        PC3
## 1 -1.6041982 -0.9598965
## 2 -1.4271937 -1.1697921
## 3 -1.9002481 -0.7015852
## 4 -1.9003218 -0.7037807
## 5 -0.5787277 -0.8997913
## 6 -0.5787277 -0.8997913
## 'data.frame':    119390 obs. of  18 variables:
##  $ is_canceled            : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ hotel                  : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ arrival_date_month     : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ meal                   : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment         : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel   : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ reserved_room_type     : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type     : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ deposit_type           : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                  : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ customer_type          : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ reservation_status     : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date: Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
##  $ PC1                    : num  -1.72 -3.97 1.06 1.02 1.3 ...
##  $ PC2                    : num  -1.604 -1.427 -1.9 -1.9 -0.579 ...
##  $ PC3                    : num  -0.96 -1.17 -0.702 -0.704 -0.9 ...
##   is_canceled        hotel arrival_date_month meal country market_segment
## 1           0 Resort Hotel               July   BB     PRT         Direct
## 2           0 Resort Hotel               July   BB     PRT         Direct
## 3           0 Resort Hotel               July   BB     GBR         Direct
## 4           0 Resort Hotel               July   BB     GBR      Corporate
## 5           0 Resort Hotel               July   BB     GBR      Online TA
## 6           0 Resort Hotel               July   BB     GBR      Online TA
##   distribution_channel reserved_room_type assigned_room_type deposit_type
## 1               Direct                  C                  C   No Deposit
## 2               Direct                  C                  C   No Deposit
## 3               Direct                  A                  C   No Deposit
## 4            Corporate                  A                  A   No Deposit
## 5                TA/TO                  A                  A   No Deposit
## 6                TA/TO                  A                  A   No Deposit
##   customer_type reservation_status       PC1        PC2        PC3
## 1     Transient          Check-Out -1.715752 -1.6041982 -0.9598965
## 2     Transient          Check-Out -3.968765 -1.4271937 -1.1697921
## 3     Transient          Check-Out  1.057518 -1.9002481 -0.7015852
## 4     Transient          Check-Out  1.021634 -1.9003218 -0.7037807
## 5     Transient          Check-Out  1.298097 -0.5787277 -0.8997913
## 6     Transient          Check-Out  1.298097 -0.5787277 -0.8997913
##   country.ABW country.AGO country.AIA country.ALB country.AND country.ARE
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.ARG country.ARM country.ASM country.ATA country.ATF country.AUS
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.AUT country.AZE country.BDI country.BEL country.BEN country.BFA
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.BGD country.BGR country.BHR country.BHS country.BIH country.BLR
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.BOL country.BRA country.BRB country.BWA country.CAF country.CHE
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.CHL country.CHN country.CIV country.CMR country.CN country.COL
## 1           0           0           0           0          0           0
## 2           0           0           0           0          0           0
## 3           0           0           0           0          0           0
## 4           0           0           0           0          0           0
## 5           0           0           0           0          0           0
## 6           0           0           0           0          0           0
##   country.COM country.CPV country.CRI country.CUB country.CYM country.CYP
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.CZE country.DEU country.DJI country.DMA country.DNK country.DOM
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.DZA country.ECU country.EGY country.ESP country.EST country.ETH
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.FIN country.FJI country.FRA country.FRO country.GAB country.GBR
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           1
## 4           0           0           0           0           0           1
## 5           0           0           0           0           0           1
## 6           0           0           0           0           0           1
##   country.GEO country.GGY country.GHA country.GIB country.GLP country.GNB
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.GRC country.GTM country.GUY country.HKG country.HND country.HRV
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.HUN country.IDN country.IMN country.IND country.IRL country.IRN
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.IRQ country.ISL country.ISR country.ITA country.JAM country.JEY
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.JOR country.JPN country.KAZ country.KEN country.KHM country.KIR
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.KNA country.KOR country.KWT country.LAO country.LBN country.LBY
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.LCA country.LIE country.LKA country.LTU country.LUX country.LVA
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.MAC country.MAR country.MCO country.MDG country.MDV country.MEX
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.MKD country.MLI country.MLT country.MMR country.MNE country.MOZ
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.MRT country.MUS country.MWI country.MYS country.MYT country.NAM
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.NCL country.NGA country.NIC country.NLD country.NOR country.NPL
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.NULL country.NZL country.OMN country.PAK country.PAN country.PER
## 1            0           0           0           0           0           0
## 2            0           0           0           0           0           0
## 3            0           0           0           0           0           0
## 4            0           0           0           0           0           0
## 5            0           0           0           0           0           0
## 6            0           0           0           0           0           0
##   country.PHL country.PLW country.POL country.PRI country.PRT country.PRY
## 1           0           0           0           0           1           0
## 2           0           0           0           0           1           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.PYF country.QAT country.ROU country.RUS country.RWA country.SAU
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.SDN country.SEN country.SGP country.SLE country.SLV country.SMR
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.SRB country.STP country.SUR country.SVK country.SVN country.SWE
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.SYC country.SYR country.TGO country.THA country.TJK country.TMP
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.TUN country.TUR country.TWN country.TZA country.UGA country.UKR
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.UMI country.URY country.USA country.UZB country.VEN country.VGB
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   country.VNM country.ZAF country.ZMB country.ZWE
## 1           0           0           0           0
## 2           0           0           0           0
## 3           0           0           0           0
## 4           0           0           0           0
## 5           0           0           0           0
## 6           0           0           0           0

We will use PCA for fictitious variables because there are too many columns

##         PC1        PC2
## 1 -1.440801 0.03290349
## 2 -1.440801 0.03290349
## 3  1.150869 2.73906746
## 4  1.150869 2.73906746
## 5  1.150869 2.73906746
## 6  1.150869 2.73906746
## 'data.frame':    119390 obs. of  16 variables:
##  $ is_canceled         : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ hotel               : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ arrival_date_month  : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ meal                : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ market_segment      : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel: Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ reserved_room_type  : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type  : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ deposit_type        : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ customer_type       : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ reservation_status  : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ PC1                 : num  -1.72 -3.97 1.06 1.02 1.3 ...
##  $ PC2                 : num  -1.604 -1.427 -1.9 -1.9 -0.579 ...
##  $ PC3                 : num  -0.96 -1.17 -0.702 -0.704 -0.9 ...
##  $ PC1.1               : num  -1.44 -1.44 1.15 1.15 1.15 ...
##  $ PC2.1               : num  0.0329 0.0329 2.7391 2.7391 2.7391 ...
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 52680     0
##          1     0 30880
##                                    
##                Accuracy : 1        
##                  95% CI : (1, 1)   
##     No Information Rate : 0.6304   
##     P-Value [Acc > NIR] : < 2.2e-16
##                                    
##                   Kappa : 1        
##                                    
##  Mcnemar's Test P-Value : NA       
##                                    
##             Sensitivity : 1.0000   
##             Specificity : 1.0000   
##          Pos Pred Value : 1.0000   
##          Neg Pred Value : 1.0000   
##              Prevalence : 0.6304   
##          Detection Rate : 0.6304   
##    Detection Prevalence : 0.6304   
##       Balanced Accuracy : 1.0000   
##                                    
##        'Positive' Class : 0        
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22486     0
##          1     0 13344
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.6276     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6276     
##          Detection Rate : 0.6276     
##    Detection Prevalence : 0.6276     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

We give up the Country variable

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22486     0
##          1     0 13344
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.6276     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6276     
##          Detection Rate : 0.6276     
##    Detection Prevalence : 0.6276     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Time series

## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## `summarise()` has grouped output by 'reservation_status_date'. You can override using the `.groups` argument.

## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'tseries' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'forecast' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Warning: package 'tseries' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
##  Augmented Dickey-Fuller Test
## 
## data:  hs
## Dickey-Fuller = -3.2251, Lag order = 2, p-value = 0.1057
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  hs
## KPSS Level = 0.42882, Truncation lag parameter = 2, p-value = 0.06473

## Warning: package 'forecast' was built under R version 3.6.3
## 
##  ARIMA(2,1,2)(0,1,0)[12]                    : Inf
##  ARIMA(0,1,0)(0,1,0)[12]                    : 195.3849
##  ARIMA(1,1,0)(0,1,0)[12]                    : 198.1213
##  ARIMA(0,1,1)(0,1,0)[12]                    : 198.1003
##  ARIMA(1,1,1)(0,1,0)[12]                    : 201.5643
## 
##  Best model: ARIMA(0,1,0)(0,1,0)[12]

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)(0,1,0)[12]
## Q* = 4.6398, df = 5, p-value = 0.4614
## 
## Model df: 0.   Total lags used: 5
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -72.98426 286.6851 164.4508 -2.558678 5.461286 0.3415385
##                    ACF1
## Training set 0.02412717
##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Sep 2017           3205  2685.4155 3724.585  2410.3641 3999.636
## Oct 2017           3551  2816.1965 4285.803  2427.2151 4674.785
## Nov 2017           2909  2009.0532 3808.947  1532.6502 4285.350
## Dec 2017           2090  1050.8310 3129.169   500.7281 3679.272
## Jan 2018           2492  1330.1737 3653.826   715.1400 4268.860
## Feb 2018           2562  1289.2831 3834.717   615.5474 4508.453
## Mar 2018           3073  1698.3086 4447.691   970.5909 5175.409
## Apr 2018           3039  1569.3931 4508.607   791.4302 5286.570
## May 2018           3430  1871.2465 4988.754  1046.0922 5813.908
## Jun 2018           3055  1411.9295 4698.070   542.1405 5567.859
## Jul 2018           3193  1469.7331 4916.267   557.4908 5828.509
## Aug 2018           2954  1154.1065 4753.894   201.3004 5706.700
## Sep 2018           3062   983.6620 5140.338  -116.5437 6240.544
## Oct 2018           3408  1084.3474 5731.653  -145.7199 6961.720
## Nov 2018           2766   220.5661 5311.434 -1126.9052 6658.905
## Dec 2018           1947  -802.3828 4696.383 -2257.8181 6151.818
## Jan 2019           2349  -590.2139 5288.214 -2146.1397 6844.140
## Feb 2019           2419  -698.5071 5536.507 -2348.8156 7186.816
## Mar 2019           2930  -356.1410 6216.141 -2095.7189 7955.719
## Apr 2019           2896  -550.5337 6342.534 -2375.0185 8167.018
## May 2019           3287  -312.7871 6886.787 -2218.3993 8792.399
## Jun 2019           2912  -834.7772 6658.777 -2818.2012 8642.201
## Jul 2019           3050  -838.2145 6938.214 -2896.5108 8996.511
## Aug 2019           2811 -1213.6843 6835.684 -3344.2235 8966.224

##              xhat    level      trend     season
## Jul 2016 2728.014 2725.259   38.14394  -35.38889
## Aug 2016 3151.058 2895.019  169.76059   86.27778
## Sep 2016 3483.167 3098.038  203.01837  182.11111
## Oct 2016 3937.059 3250.514  152.47585  534.06944
## Nov 2016 3273.471 3312.104   61.58986 -100.22222
## Dec 2016 2350.017 3290.880  -21.22389 -919.63889
## Jan 2017 2416.407 3225.900  -64.97949 -744.51389
## Feb 2017 2849.278 3242.658   16.75814 -410.13889
## Mar 2017 3459.721 3205.467  -37.19093  291.44444
## Apr 2017 3361.930 3077.143 -128.32442  413.11111
## May 2017 3284.420 2881.538 -195.60472  598.48611
## Jun 2017 2810.547 2793.841  -87.69709  104.40278
## Jul 2017 3093.185 2851.023   57.18162  184.98057
## Aug 2017 3288.938 2998.999  147.97642  141.96220
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 7.270737 233.2177 218.2107 0.1472528 6.987008 0.4531895 0.3437704
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 2017       3248.897 2938.885 3558.910 2774.775 3723.020
## Oct 2017       3609.514 3222.398 3996.629 3017.471 4201.556
## Nov 2017       3064.943 2544.560 3585.327 2269.086 3860.801
## Dec 2017       2387.129 1690.133 3084.125 1321.165 3453.092
## Jan 2018       2848.577 1942.073 3755.080 1462.199 4234.954
## Feb 2018       3031.974 1889.385 4174.562 1284.535 4779.412
## Mar 2018       3747.504 2346.090 5148.918 1604.226 5890.782
## Apr 2018       3985.315 2304.844 5665.785 1415.257 6555.372

Conclusions and recommendations